Skip to content

Add ms.hdfs.reader.parse.json.strings for inlining JSON strings#93

Merged
avinas-kumar merged 6 commits intolinkedin:masterfrom
vigy321:hdfs-reader-inline-json-strings
Apr 30, 2026
Merged

Add ms.hdfs.reader.parse.json.strings for inlining JSON strings#93
avinas-kumar merged 6 commits intolinkedin:masterfrom
vigy321:hdfs-reader-inline-json-strings

Conversation

@vigy321
Copy link
Copy Markdown

@vigy321 vigy321 commented Apr 22, 2026

Problem

HdfsReader.selectFieldsFromGenericRecord serializes Avro string fields as JSON string primitives via JsonObject.addProperty, which double-encodes pre-serialized JSON payloads when the output is used as an outbound HTTP request body. The ARRAY and RECORD branches already inline via gson.fromJson(...) + jsonObject.add(...); the STRING/UNION branch has no equivalent.

Use case

JSON-LD payloads use @context / @type field names, which are not valid Avro identifiers, so a structured Avro record cannot represent them. The workaround is to store a pre-serialized JSON-LD document as a string field in Avro, but DIL currently escapes such strings when emitting to an HTTP POST body, producing malformed payloads.

Change

Adds a new boolean property ms.hdfs.reader.parse.json.strings (default false). When enabled, the STRING/UNION branch of selectFieldsFromGenericRecord attempts to parse string values that look like JSON ({…} or […]) via gson.fromJson(..., JsonElement.class) and inlines the result as a JsonElement. Parse failures fall back to the existing string-primitive behavior.

The implementation matches the style of the adjacent ARRAY and RECORD branches, which already use gson.fromJson(..., JsonArray.class) and gson.fromJson(..., JsonObject.class). Using the polymorphic JsonElement.class here lets the new branch handle both object-rooted and array-rooted JSON content without pre-classifying the input.

Back-compatibility

  • Property default is false; existing jobs are unaffected.
  • New code path has a try/catch fallback to the existing behavior on parse failure.
  • No new dependencies (Gson is already on the classpath).

Tests

Added 8 TestNG cases in HdfsReaderTest (cdi-core/src/test/java/com/linkedin/cdi/util/HdfsReaderTest.java) using Whitebox.invokeMethod for the private method, matching the pattern in AvroExtractorTest:

Flag on:

  • JSON object → inlined as JsonObject
  • JSON array (JSON-LD shape with @context) → inlined as JsonArray
  • Plain text → preserved as JsonPrimitive
  • Malformed JSON (e.g. [{incomplete]) → falls back to JsonPrimitive
  • null value → JsonNull
  • Empty string → preserved as empty JsonPrimitive

Flag off (back-compat gate):

  • JSON-looking content stays as a JsonPrimitive (the regression the flag must not cause)
  • Plain text unchanged

All 8 tests pass locally on JDK 1.8 + Gradle 6.8.1 (matches CI environment).

Files changed

  • cdi-core/src/main/java/com/linkedin/cdi/configuration/PropertyCollection.java — register MSTAGE_HDFS_READER_PARSE_JSON_STRINGS
  • cdi-core/src/main/java/com/linkedin/cdi/util/HdfsReader.java — conditional inlining branch + looksLikeJson helper
  • cdi-core/src/test/java/com/linkedin/cdi/util/HdfsReaderTest.java — new test class
  • docs/parameters/ms.hdfs.reader.parse.json.strings.md — new property doc page

Documentation

New page at docs/parameters/ms.hdfs.reader.parse.json.strings.md describing behavior, default, and an example contrasting flag-off vs flag-on output.

E2E test validation

PR in guest-workflows-spark for snapshot test, where we pointed to the snapshot version of dil-internal and passed the new flag as true

https://github.com/linkedin-multiproduct/guest-workflows-spark/pull/362

verified the DAG ran successfully in guest-workflows-spark with the new flag passed as true

Screenshot 2026-04-28 at 4 19 39 AM

confirmed from Google POC they are able to see JSON object from snapshot test

Screenshot 2026-04-28 at 4 17 17 AM

HdfsReader.selectFieldsFromGenericRecord serializes Avro string fields
as JSON string primitives, which double-encodes pre-serialized JSON
payloads (e.g. JSON-LD whose @-prefixed names are not valid Avro
identifiers). When the new property is true, string values that parse
as JSON objects or arrays are inlined via JsonParser.parseString; parse
failures fall back to the existing primitive behavior. Default is false,
so existing jobs are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vigy321 vigy321 marked this pull request as ready for review April 22, 2026 18:26
Vignesh Rao and others added 2 commits April 26, 2026 21:45
DIL pins Gson 2.6.2 (gradle/scripts/dependencyDefinitions.gradle:31),
which doesn't have the static JsonParser.parseString(String) introduced
in Gson 2.8.6. Switch to the instance method new JsonParser().parse(s),
which has identical semantics on this version and throws the same
JsonSyntaxException on malformed input. Caught by local test run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace new JsonParser().parse(decrypted) with gson.fromJson(decrypted,
JsonElement.class). Both calls go through the same Gson parser and
throw the same JsonSyntaxException on malformed input, but fromJson
matches the style of the adjacent ARRAY/RECORD branches, reuses the
existing gson field instead of allocating a JsonParser per record,
and avoids the @deprecated marker on JsonParser's instance method
in newer Gson versions. All 8 HdfsReaderTest cases continue to pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread cdi-core/src/test/java/com/linkedin/cdi/util/HdfsReaderTest.java
…hema path

Two review fixes from gautamshanu:

1. Add MSTAGE_HDFS_READER_PARSE_JSON_STRINGS to the allProperties list
   in PropertyCollection so it participates in startup validation
   alongside every other MSTAGE_* property.

2. Add a test that uses a nullable string schema (Avro UNION ["string",
   "null"]) to exercise the UNION branch of selectFieldsFromGenericRecord.
   Existing tests covered Schema.Type.STRING; production traffic for the
   target use case is UNION-typed, so this widens coverage to that path.

All 9 HdfsReaderTest cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread cdi-core/src/main/java/com/linkedin/cdi/util/HdfsReader.java Outdated
Comment thread cdi-core/src/main/java/com/linkedin/cdi/util/HdfsReader.java Outdated
Vignesh Rao and others added 2 commits April 29, 2026 08:50
- Rename `looksLikeJson` to `isValidJson` for clarity.
- Rename parameter `s` to `value` and local `t` to `trimmed`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buildscript classpath used `latest.release`, which now resolves to
6.0.4. JFrog 6.0.0+ depends on jackson-databind:2.15.4, transitively
pulling jackson-core:2.15.4 — a multi-release JAR containing Java 17
(major version 61) class files. Gradle 6.8.1's classpath instrumenter
walks every class via a bundled ASM that does not understand Java 17
bytecode and fails with:

  Failed to create Jar file ~/.gradle/caches/jars-8/<hash>/jackson-core-2.15.4.jar
  Caused by: java.lang.IllegalArgumentException: Unsupported class file major version 61

5.2.5 is the last release on jackson-databind:2.14.1, which is Java 8
compatible and configures cleanly under Gradle 6.8.1.
Comment thread build.gradle
@avinas-kumar avinas-kumar merged commit 8f05003 into linkedin:master Apr 30, 2026
1 check passed
@vigy321 vigy321 mentioned this pull request Apr 30, 2026
3 tasks
avinas-kumar pushed a commit that referenced this pull request Apr 30, 2026
Cut release 0.2.119 covering:
- #93 (HDFS reader: ms.hdfs.reader.parse.json.strings flag)
- #94 (build-info-extractor-gradle pin to 5.2.5)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants